Name: Parin Bhaduri
ID: pbb62
The following code loads the environment and makes sure all needed packages are installed. This should be at the start of most Julia scripts.
import Pkg
Pkg.activate(@__DIR__)
Pkg.instantiate()
Activating project at `~/Library/CloudStorage/OneDrive-CornellUniversity/Coding-Assignments/BEE Climate Risk Analysis/Homeworks/hw4-parinbhaduri`
using Random
using StatsBase
using CSV # load CSV data
using Dates
using DataFrames # data storage and presentation
using DataFramesMeta
using Plots # plotting library
using StatsPlots # statistical plotting
using Distributions # statistical distribution interface
using Turing # probabilistic programming and MCMC
using Optim # optimization library
Random.seed!(1);
Next, let's load and process the data (as we did in Lab 09). This notebook uses hourly data from the San Francisco, CA tide gauge station, obtained from the University of Hawaii Sea Level Center (Caldwell et al (2015)).
## load the data from the file and return a DataFrame of DateTime values and gauge measurements
function load_data(fname)
date_format = DateFormat("yyyy-mm-dd HH:MM:SS")
# This uses the DataFramesMeta.jl package, which makes it easy to string together commands to load and process data
df = @chain fname begin
CSV.read(DataFrame; header=false)
rename("Column1" => "year", "Column2" => "month", "Column3" => "day", "Column4" => "hour", "Column5" => "gauge")
# need to reformat the decimal date in the data file
@transform :datetime = DateTime.(:year, :month, :day, :hour)
# replace -99999 with missing
@transform :gauge = ifelse.(abs.(:gauge) .>= 9999, missing, :gauge)
select(:datetime, :gauge)
end
return df
end
dat = load_data("data/h551.csv")
| Row | datetime | gauge |
|---|---|---|
| DateTime | Int64? | |
| 1 | 1897-08-01T08:00:00 | 3292 |
| 2 | 1897-08-01T09:00:00 | 3322 |
| 3 | 1897-08-01T10:00:00 | 3139 |
| 4 | 1897-08-01T11:00:00 | 2835 |
| 5 | 1897-08-01T12:00:00 | 2377 |
| 6 | 1897-08-01T13:00:00 | 2012 |
| 7 | 1897-08-01T14:00:00 | 1768 |
| 8 | 1897-08-01T15:00:00 | 1646 |
| 9 | 1897-08-01T16:00:00 | 1737 |
| 10 | 1897-08-01T17:00:00 | 1951 |
| 11 | 1897-08-01T18:00:00 | 2286 |
| 12 | 1897-08-01T19:00:00 | 2621 |
| 13 | 1897-08-01T20:00:00 | 2957 |
| ⋮ | ⋮ | ⋮ |
| 1100081 | 2023-01-31T12:00:00 | 3180 |
| 1100082 | 2023-01-31T13:00:00 | 3365 |
| 1100083 | 2023-01-31T14:00:00 | 3515 |
| 1100084 | 2023-01-31T15:00:00 | 3606 |
| 1100085 | 2023-01-31T16:00:00 | 3593 |
| 1100086 | 2023-01-31T17:00:00 | 3414 |
| 1100087 | 2023-01-31T18:00:00 | 3125 |
| 1100088 | 2023-01-31T19:00:00 | 2700 |
| 1100089 | 2023-01-31T20:00:00 | 2315 |
| 1100090 | 2023-01-31T21:00:00 | 2047 |
| 1100091 | 2023-01-31T22:00:00 | 1922 |
| 1100092 | 2023-01-31T23:00:00 | 1913 |
## detrend the data to remove the effects of sea-level rise and seasonal dynamics
ma_length = 366
ma_offset = Int(floor(ma_length/2))
moving_average(series,n) = [mean(@view series[i-n:i+n]) for i in n+1:length(series)-n]
dat_ma = DataFrame(datetime=dat.datetime[ma_offset+1:end-ma_offset], residual=dat.gauge[ma_offset+1:end-ma_offset] .- moving_average(dat.gauge, ma_offset))
| Row | datetime | residual |
|---|---|---|
| DateTime | Float64? | |
| 1 | 1897-08-08T23:00:00 | 140.967 |
| 2 | 1897-08-09T00:00:00 | 357.455 |
| 3 | 1897-08-09T01:00:00 | 574.109 |
| 4 | 1897-08-09T02:00:00 | 790.847 |
| 5 | 1897-08-09T03:00:00 | 975.095 |
| 6 | 1897-08-09T04:00:00 | 943.264 |
| 7 | 1897-08-09T05:00:00 | 818.689 |
| 8 | 1897-08-09T06:00:00 | 479.869 |
| 9 | 1897-08-09T07:00:00 | -42.6158 |
| 10 | 1897-08-09T08:00:00 | -503.768 |
| 11 | 1897-08-09T09:00:00 | -872.839 |
| 12 | 1897-08-09T10:00:00 | -1118.25 |
| 13 | 1897-08-09T11:00:00 | -1269.75 |
| ⋮ | ⋮ | ⋮ |
| 1099715 | 2023-01-23T21:00:00 | 925.384 |
| 1099716 | 2023-01-23T22:00:00 | 563.422 |
| 1099717 | 2023-01-23T23:00:00 | -46.0163 |
| 1099718 | 2023-01-24T00:00:00 | -656.632 |
| 1099719 | 2023-01-24T01:00:00 | -1119.84 |
| 1099720 | 2023-01-24T02:00:00 | -1397.12 |
| 1099721 | 2023-01-24T03:00:00 | -1425.0 |
| 1099722 | 2023-01-24T04:00:00 | -1289.18 |
| 1099723 | 2023-01-24T05:00:00 | -961.886 |
| 1099724 | 2023-01-24T06:00:00 | -540.733 |
| 1099725 | 2023-01-24T07:00:00 | -120.428 |
| 1099726 | 2023-01-24T08:00:00 | 228.251 |
## group data by year and compute the annual maxima
dat_ma = dropmissing(dat_ma) # drop missing data
dat_annmax = combine(dat_ma -> dat_ma[argmax(dat_ma.residual), :], groupby(transform(dat_ma, :datetime => x->year.(x)), :datetime_function))
delete!(dat_annmax, nrow(dat_annmax)) # delete 2023; haven't seen much of that year yet
rename!(dat_annmax, :datetime_function => :Year)
select!(dat_annmax, [:Year, :residual])
| Row | Year | residual |
|---|---|---|
| Int64 | Float64 | |
| 1 | 1897 | 1247.5 |
| 2 | 1898 | 1237.02 |
| 3 | 1899 | 1471.5 |
| 4 | 1900 | 1276.62 |
| 5 | 1901 | 1282.43 |
| 6 | 1902 | 1165.63 |
| 7 | 1903 | 1180.59 |
| 8 | 1904 | 1221.89 |
| 9 | 1905 | 1240.64 |
| 10 | 1906 | 1175.34 |
| 11 | 1907 | 1295.24 |
| 12 | 1908 | 1360.57 |
| 13 | 1909 | 1216.81 |
| ⋮ | ⋮ | ⋮ |
| 115 | 2011 | 1348.5 |
| 116 | 2012 | 1298.43 |
| 117 | 2013 | 1303.97 |
| 118 | 2014 | 1264.41 |
| 119 | 2015 | 1256.0 |
| 120 | 2016 | 1355.96 |
| 121 | 2017 | 1261.74 |
| 122 | 2018 | 1273.0 |
| 123 | 2019 | 1251.56 |
| 124 | 2020 | 1273.07 |
| 125 | 2021 | 1227.76 |
| 126 | 2022 | 1320.36 |
In this problem, you will fit a stationary GEV model and compute the Akaike and Deviance Information Criteria. You will also look at relevant graphical checks.
Construct a Turing.jl stationary model with appropriate priors (there's no right choice; remember that thinking about the priors is part of the model checking process). Find the MLE estimate (you'll need this for the AIC) and use MCMC to sample from the posterior.
tide_hist = @df dat_annmax histogram(:residual)
xaxis!(tide_hist, "Annual Maxima")
yaxis!(tide_hist, "Count")
@model function gev_annmax(y)
## pick priors
μ ~ Normal(1250, 200) # location
σ ~ truncated(Normal(0, 100); lower=0) # scale
ξ ~ Normal(0, 10) # shape; pay special attention to this one, and you may need to experiment
## likelihood
for t in 1:length(y)
y[t] ~ GeneralizedExtremeValue(μ, σ, ξ)
end
end
gev_annmax (generic function with 2 methods)
gev_model = gev_annmax(dat_annmax[!, :residual])
DynamicPPL.Model{typeof(gev_annmax), (:y,), (), (), Tuple{Vector{Float64}}, Tuple{}, DynamicPPL.DefaultContext}(gev_annmax, (y = [1247.5013623978202, 1237.0217983651228, 1471.4986376021798, 1276.6158038147137, 1282.4277929155314, 1165.632152588556, 1180.5858310626704, 1221.8882833787466, 1240.6376021798364, 1175.3433242506812 … 1303.9727520435968, 1264.4059945504086, 1256.0027247956405, 1355.9618528610354, 1261.7356948228885, 1273.0027247956405, 1251.5558583106267, 1273.0653950953679, 1227.7574931880108, 1320.359673024523],), NamedTuple(), DynamicPPL.DefaultContext())
MLE_est = optimize(gev_model, MLE(),Optim.Options(iterations=10_000, allow_f_increases=true)) #.values
ModeResult with maximized lp of -Inf [1261.1648013797742, 12.799903861288488, 10.814922793760273]
chain = sample(gev_model, NUTS(), MCMCThreads(), 5000, 4, drop_warmup=true);
@show chain
Sampling (4 threads) 0%| | ETA: N/A ┌ Info: Found initial step size │ ϵ = 0.2 └ @ Turing.Inference /Users/pbb62/.julia/packages/Turing/Suzsv/src/inference/hmc.jl:190 ┌ Info: Found initial step size │ ϵ = 0.4 └ @ Turing.Inference /Users/pbb62/.julia/packages/Turing/Suzsv/src/inference/hmc.jl:190 ┌ Info: Found initial step size │ ϵ = 0.4 └ @ Turing.Inference /Users/pbb62/.julia/packages/Turing/Suzsv/src/inference/hmc.jl:190 ┌ Info: Found initial step size │ ϵ = 0.025 └ @ Turing.Inference /Users/pbb62/.julia/packages/Turing/Suzsv/src/inference/hmc.jl:190 Sampling (4 threads) 25%|███████▌ | ETA: 0:01:09 Sampling (4 threads) 50%|███████████████ | ETA: 0:00:23 Sampling (4 threads) 75%|██████████████████████▌ | ETA: 0:00:08 Sampling (4 threads) 100%|██████████████████████████████| Time: 0:00:23 Sampling (4 threads) 100%|██████████████████████████████| Time: 0:00:23
chain = MCMC chain (5000×15×4 Array{Float64, 3})
Chains MCMC chain (5000×15×4 Array{Float64, 3}):
Iterations = 1001:1:6000
Number of chains = 4
Samples per chain = 5000
Wall duration = 10.58 seconds
Compute duration = 37.02 seconds
parameters = μ, σ, ξ
internals = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size
Summary Statistics
parameters mean std mcse ess_bulk ess_tail rh ⋯
Symbol Float64 Float64 Float64 Float64 Float64 Float ⋯
μ 1258.5113 5.6784 0.0476 14264.0671 13708.9539 1.00 ⋯
σ 57.3039 4.2055 0.0384 12058.9525 12988.4898 1.00 ⋯
ξ 0.0284 0.0630 0.0007 7799.7675 7770.7863 1.00 ⋯
2 columns omitted
Quantiles
parameters 2.5% 25.0% 50.0% 75.0% 97.5%
Symbol Float64 Float64 Float64 Float64 Float64
μ 1247.5861 1254.6083 1258.4304 1262.3024 1269.7907
σ 49.6903 54.3675 57.0798 60.0217 66.1480
ξ -0.0820 -0.0165 0.0241 0.0692 0.1608
plot(chain)
corner(chain)
Compute AIC and DIC for the stationary model.
##Calculate DIC for MCMC
#Calculate parameter averages
post_means = mean(chain)[:, :mean]
log_avg = sum([logpdf(GeneralizedExtremeValue(post_means[1], post_means[2], post_means[3]), y) for y in dat_annmax.residual])
#Calculate average log-likelihood
μ = Array(group(chain, :μ))
σ = Array(group(chain, :σ))
ξ = Array(group(chain, :ξ))
log_like = zeros(length(μ))
for step in 1:length(log_like)
log_like[step] = sum([logpdf(GeneralizedExtremeValue(μ[step], σ[step], ξ[step]), y) for y in dat_annmax.residual])
end
avg_log = mean(log_like)
-709.1663574092747
elpd_dic = log_avg - (2*(log_avg - avg_log))
DIC = -2 * elpd_dic
1421.1899397574339
Conduct relevant graphical checks you can think of. These could include return periods, credible intervals for the data, or any other statistics that seem appropriate. Explain what each check is for and what your conclusions are about the model based on them. What would you change about the model in 1.1, if anything, based on these checks?
function predict_tide(chain, rp)
# the Array(group()) syntax is more general than we need, but will work if we have multiple variables which were sampled as a group, for example multiple regression coefficients.
μ = Array(group(chain, :μ))
σ = Array(group(chain, :σ))
ξ = Array(group(chain, :ξ))
#Create GEV Distribution samples
GEV_samples = GeneralizedExtremeValue.(μ,σ,ξ)
GEV_quantiles = quantile.(GEV_samples, (1-1/rp))
return GEV_quantiles
end
predict_tide (generic function with 1 method)
GEV_50 = predict_tide(chain, 50)
GEV_100 = predict_tide(chain, 100)
GEV_500 = predict_tide(chain, 500)
20000-element Vector{Float64}:
1569.726483392688
1681.2460667010464
1538.9929737891125
1858.7121135219022
1632.6764379589988
1653.3821628199275
1605.1214490969414
1643.1603684235615
1651.5694255116748
1643.7823208033296
⋮
1697.2800288021583
1587.4010666115137
1587.4010666115137
1672.9455316431736
1675.3999654658387
1585.3614667363315
1655.9959543706282
1629.7058545279008
1619.548738600544
density([GEV_50 GEV_100 GEV_500], lw = 3, label = ["50-yr" "100-yr" "500-yr"])
xlabel!("Return Level")
ylabel!("Density")
param_array = vcat(μ', σ', ξ')
param_low = quantile.(eachrow(param_array), 0.025);
param_hi = quantile.(eachrow(param_array), 0.975);
gev_low = [rand(GeneralizedExtremeValue(param_low[1], param_low[2], param_low[3])) for _ in 1: length(dat_annmax.residual)]
gev_hi = [rand(GeneralizedExtremeValue(param_hi[1], param_hi[2], param_hi[3])) for _ in 1: length(dat_annmax.residual)]
126-element Vector{Float64}:
1380.1047649686989
1230.4910463417737
1289.579295172305
1222.1229638007362
1350.3598025432084
1381.063561636429
1359.907482399661
1329.541989980919
1336.0247267703285
1215.3086854086719
⋮
1596.7979067349142
1259.4610850692777
1718.6186634677772
1277.3937222655313
1283.8667199802223
1330.287806631481
1244.322626823678
1654.4114248877136
1601.7988460268318
avg_param = [rand(GeneralizedExtremeValue(post_means[1], post_means[2], post_means[3])) for _ in 1: length(dat_annmax.residual)]
126-element Vector{Float64}:
1413.5118354423369
1216.7156547231243
1272.911149043987
1238.09852233632
1358.9311502839678
1279.100698329516
1324.2337255463272
1280.5349222329062
1344.1018446437693
1240.6212131643965
⋮
1225.0524685908986
1579.1617390575489
1414.8372899493722
1521.8731767994718
1419.1616782839922
1454.0177343827663
1259.6525266483557
1329.345707266223
1291.3912435394564
histogram(dat_annmax.residual, normalize = :pdf, label = "data")
density!([gev_low gev_hi], label = false, line = :dash, lw = 2.5)
density!(avg_param, label = "avg param", lw = 2.5)
Here, I compare the original distribution of extreme tide gauge values with the posterior distribution calculated from the average parameter values. Based on this graph, I see that the posterior follows the original distribution well, peaking around the same value. The dotted line distributions present the 95% predictive interval for the posterior distribution. We see that the original data is bounded by the interval, although the lower interval has an extended tail reaching out to upper extreme values not present in the original data. This could possibly be due to a few outliers in the original annual maximum values, which could skew the marginal likelihood towards higher extremes. I may try and remove outliers in the original dataset to see if my posterior interval is better constratined.
Next, we will model the tidal extremes using a non-stationary GEV distribution, where the location parameter (but not the shape or scale) is represented by a linear regression $$\mu_t = \mu_0 + \mu_1 x_t,$$ where $x_t$ is the annual mean Pacific Decadal Oscillation (PDO) index, which is based on the variability of sea-surface temperatures (SSTs) in the North Pacific (versus the El Niño–Southern Oscillation (ENSO), which emphasizes the equatorial SSTs).
First, let's load the PDO index dataset from NOAA.
## load the data from the file and return a DataFrame of DateTime values and gauge measurements
function load_pdo(fname)
# This uses the DataFramesMeta.jl package, which makes it easy to string together commands to load and process data
df = CSV.read(fname, DataFrame; delim=" ", header=2, ignorerepeated=true)
# take yearly average
@transform!(df, :PDO = mean(AsTable(names(df)[2:13])))
@select!(df, $[:Year, :PDO])
@rsubset!(df, :Year != 2023)
return df
end
pdo = load_pdo("data/ersst.v5.pdo.dat")
# subset for years that match the tide gauge data
years = dat_annmax[!, :Year]
@rsubset!(pdo, :Year in years)
| Row | Year | PDO |
|---|---|---|
| Int64 | Float64 | |
| 1 | 1897 | 0.140833 |
| 2 | 1898 | 0.0933333 |
| 3 | 1899 | -0.351667 |
| 4 | 1900 | 0.806667 |
| 5 | 1901 | -0.185833 |
| 6 | 1902 | 1.22167 |
| 7 | 1903 | 0.281667 |
| 8 | 1904 | 1.16 |
| 9 | 1905 | 1.3825 |
| 10 | 1906 | 0.810833 |
| 11 | 1907 | 0.275 |
| 12 | 1908 | 0.73 |
| 13 | 1909 | -0.711667 |
| ⋮ | ⋮ | ⋮ |
| 115 | 2011 | -1.81167 |
| 116 | 2012 | -1.73333 |
| 117 | 2013 | -1.16583 |
| 118 | 2014 | 0.554167 |
| 119 | 2015 | 0.920833 |
| 120 | 2016 | 0.673333 |
| 121 | 2017 | -0.0933333 |
| 122 | 2018 | -0.365 |
| 123 | 2019 | -0.154167 |
| 124 | 2020 | -1.145 |
| 125 | 2021 | -1.87417 |
| 126 | 2022 | -2.115 |
pdo_plot = @df pdo plot(:Year, :PDO)
xaxis!(pdo_plot, "Year")
yaxis!(pdo_plot, "Temp Anomaly")
Construct a Turing.jl nonstationary model with appropriate priors (there's no right choice; remember that thinking about the priors is part of the model checking process). Find the MLE estimate (you'll need this for the AIC) and use MCMC to sample from the posterior.
@model function anmax_nonstat(y, temp)
## pick priors
μ₀ ~ Normal(1250, 200) # location
μ₁ ~ Normal(0,1)
σ ~ truncated(Normal(0, 100); lower=0) # scale
ξ ~ Normal(0, 10) #shape
#x ~ Normal(0, 1)
## location
## likelihood
for t in 1:length(y)
y[t] ~ GeneralizedExtremeValue(μ₀ + (μ₁ * temp[t]), σ, ξ)
end
end
anmax_nonstat (generic function with 2 methods)
gev_nonstat = anmax_nonstat(dat_annmax[!, :residual], pdo[!, :PDO])
DynamicPPL.Model{typeof(anmax_nonstat), (:y, :temp), (), (), Tuple{Vector{Float64}, Vector{Float64}}, Tuple{}, DynamicPPL.DefaultContext}(anmax_nonstat, (y = [1247.5013623978202, 1237.0217983651228, 1471.4986376021798, 1276.6158038147137, 1282.4277929155314, 1165.632152588556, 1180.5858310626704, 1221.8882833787466, 1240.6376021798364, 1175.3433242506812 … 1303.9727520435968, 1264.4059945504086, 1256.0027247956405, 1355.9618528610354, 1261.7356948228885, 1273.0027247956405, 1251.5558583106267, 1273.0653950953679, 1227.7574931880108, 1320.359673024523], temp = [0.14083333333333334, 0.09333333333333334, -0.3516666666666666, 0.8066666666666666, -0.18583333333333332, 1.221666666666667, 0.2816666666666667, 1.1600000000000001, 1.3825, 0.8108333333333332 … -1.1658333333333335, 0.5541666666666666, 0.9208333333333334, 0.6733333333333333, -0.09333333333333334, -0.365, -0.15416666666666667, -1.1449999999999998, -1.8741666666666665, -2.1149999999999998]), NamedTuple(), DynamicPPL.DefaultContext())
chain_ns = sample(gev_nonstat, NUTS(), MCMCThreads(), 5000, 4, drop_warmup=true);
@show chain_ns
Sampling (4 threads) 0%| | ETA: N/A ┌ Info: Found initial step size │ ϵ = 0.2 └ @ Turing.Inference /Users/pbb62/.julia/packages/Turing/Suzsv/src/inference/hmc.jl:190 ┌ Info: Found initial step size │ ϵ = 0.2 └ @ Turing.Inference /Users/pbb62/.julia/packages/Turing/Suzsv/src/inference/hmc.jl:190 ┌ Info: Found initial step size │ ϵ = 0.2 └ @ Turing.Inference /Users/pbb62/.julia/packages/Turing/Suzsv/src/inference/hmc.jl:190 ┌ Info: Found initial step size │ ϵ = 0.2 └ @ Turing.Inference /Users/pbb62/.julia/packages/Turing/Suzsv/src/inference/hmc.jl:190 Sampling (4 threads) 25%|███████▌ | ETA: 0:00:51 Sampling (4 threads) 50%|███████████████ | ETA: 0:00:17 Sampling (4 threads) 75%|██████████████████████▌ | ETA: 0:00:06
chain_ns = MCMC chain (5000×16×4 Array{Float64, 3})
Sampling (4 threads) 100%|██████████████████████████████| Time: 0:00:18 Sampling (4 threads) 100%|██████████████████████████████| Time: 0:00:18
Chains MCMC chain (5000×16×4 Array{Float64, 3}):
Iterations = 1001:1:6000
Number of chains = 4
Samples per chain = 5000
Wall duration = 15.74 seconds
Compute duration = 57.98 seconds
parameters = μ₀, μ₁, σ, ξ
internals = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size
Summary Statistics
parameters mean std mcse ess_bulk ess_tail rh ⋯
Symbol Float64 Float64 Float64 Float64 Float64 Float ⋯
μ₀ 1258.5392 5.6958 0.0443 16526.7272 15903.2867 1.00 ⋯
μ₁ -0.3609 0.9872 0.0132 5616.7563 5284.4575 1.00 ⋯
σ 57.3398 4.2324 0.0330 16598.0711 14470.3749 1.00 ⋯
ξ 0.0282 0.0625 0.0006 12997.2434 11966.1655 1.00 ⋯
2 columns omitted
Quantiles
parameters 2.5% 25.0% 50.0% 75.0% 97.5%
Symbol Float64 Float64 Float64 Float64 Float64
μ₀ 1247.6837 1254.6494 1258.4497 1262.3567 1269.7775
μ₁ -2.2592 -1.0448 -0.3605 0.3163 1.5803
σ 49.6640 54.4113 57.0824 60.0495 66.3553
ξ -0.0842 -0.0161 0.0242 0.0683 0.1614
plot(chain_ns)
corner(chain_ns)
Compute AIC and DIC for the nonstationary model.
##Calculate DIC for MCMC
#Calculate parameter averages
post_means_ns = mean(chain_ns)[:, :mean]
μₜ = [post_means_ns[1] + (post_means_ns[2] * temp) for temp in pdo[!, :PDO]]
log_avg_ns = sum([logpdf(GeneralizedExtremeValue(μₜ[t], post_means_ns[3], post_means_ns[4]), dat_annmax.residual[t]) for t in 1:length(dat_annmax.residual)])
#Calculate average log-likelihood
μ_0 = Array(group(chain_ns, :μ₀))
μ_1 = Array(group(chain_ns, :μ₁))
σ_ns = Array(group(chain_ns, :σ))
ξ_ns = Array(group(chain_ns, :ξ))
log_like_ns = zeros(length(μ_0))
for step in 1:length(log_like_ns)
log_like_ns[step] = sum([logpdf(GeneralizedExtremeValue(μ_0[step] + (μ_1[step] * pdo[!, :PDO][t]), σ_ns[step], ξ_ns[step]), dat_annmax.residual[t]) for t in 1:length(dat_annmax.residual)])
end
avg_log_ns = mean(log_like_ns)
-709.0546261991748
elpd_dic = log_avg_ns - (2*(log_avg_ns - avg_log_ns))
DIC_ns = -2 * elpd_dic
1421.0043096549703
Conduct relevant graphical checks you can think of. These could include return periods, credible intervals for the data, or any other statistics that seem appropriate. Explain what each check is for and what your conclusions are about the model based on them. What would you change about the model in 1.1, if anything, based on these checks?
param_array = vcat(μ_0', μ_1', σ', ξ')
param_low_ns = quantile.(eachrow(param_array), 0.025);
param_hi_ns = quantile.(eachrow(param_array), 0.975);
u_low = [param_low_ns[1] + (param_low_ns[2] * temp) for temp in pdo[!, :PDO]]
u_hi = [param_hi_ns[1] + (param_hi_ns[2] * temp) for temp in pdo[!, :PDO]]
gev_low_ns = [rand(GeneralizedExtremeValue(u_low[t], param_low_ns[3], param_low_ns[4])) for t in 1:length(dat_annmax.residual)]
gev_hi_ns = [rand(GeneralizedExtremeValue(u_hi[t], param_hi_ns[3], param_hi_ns[4])) for t in 1:length(dat_annmax.residual)]
126-element Vector{Float64}:
1327.5493756281717
1222.3682746406369
1266.6686564721663
1274.0461061971532
1327.1879478692508
1345.8325657213609
1258.272115640347
1193.3738957222135
1282.731275647149
1364.0075309615966
⋮
1254.9061951254414
1529.8368099064812
1319.2458685005543
1292.6609825401097
1267.8147336432498
1589.1623105789472
1465.8816482875816
1358.1708034078363
1195.4529747990052
avg_param_ns = [rand(GeneralizedExtremeValue(μₜ[t], post_means_ns[3], post_means_ns[4])) for t in 1:length(dat_annmax.residual)]
126-element Vector{Float64}:
1305.5568575858347
1258.0580900002033
1250.9301310258884
1276.2369021202085
1366.2568063398548
1213.8431098904937
1199.350644369796
1258.6597624686783
1350.5851950054946
1278.4915400841546
⋮
1317.2619780415341
1287.2938494082814
1281.666905335983
1247.4391243209677
1307.97880042882
1271.6535104016807
1321.9211327002101
1360.746028819093
1327.7188743444085
histogram(dat_annmax.residual, normalize = :pdf, label = "data")
density!([gev_low_ns gev_hi_ns], label = false, line = :dash, lw = 2)
density!(avg_param_ns, label = "avg param", lw = 2.5)
Once again, I compare the tide gauge data with a sample of values selected from the average parameter GEV distribution. The density distribution follows the general pattern of the original data, peaking near the same extreme value. The interval captures the data well to some degree, but it seems to underestimate values in the middle of the distribution, while overestimating in the tails. However, these values are only based on a single sample from these distributions, so the interval is probably not robust. I may continue to investigate the outliers of my original data in 1.1.
Based on the information criteria and your graphical checks, what do you think is the relative evidence for dependence of the San Francisco tide gauge extremes on the PDO? Can you draw a conclusion about which model is better? What else could you try?
The DIC for both models is around 1421, and such similarity doesn't indicate strong evidence for either model. Additionally, both posterior distributions follow the original data quite well. Overall, I don't see any evidence to suggest that the inclusion of the PDO is necessary when modeling the tide guage extremes. Both models seems to perform similarly, although the stationary GEV seemed to capture the original data better than the nonstationary version, but the nonstationary model had better constrained tails. From a calibration standpoint, I would probably opt for the stationary GEV model, but if I was more interested in modeling the major extremes of my data, the nonstationary model could be more helpful.